Advanced Graphics and Data Visualization in R

Lecture 02: ggplot2 and choosing the right visualization for your audience

0.1.0 An overview of Advanced Graphics and Data Visualization in R

“Advanced Graphics and Data Visualization in R” is brought to you by the Centre for the Analysis of Genome Evolution & Function’s (CAGEF) bioinformatics training initiative. CSB1021 was developed to enhance the skills of students with basic backgrounds in R by focusing on available philosophies, methods, and packages for plotting scientific data. Many of the datasets and examples used in this course will be drawn from real-world datasets and the techniques learned herein aim to be broadly applicable to multiple fields.

This lesson is the second in a 6-part series. The aim for the end of this series is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) from their experimental data.

The structure of the class is a code-along style in R markdown notebooks. At the start of each lecture, skeleton versions of the lecture will be provided for use on the University of Toronto datatools Hub so students can program along with the instructor.


0.2.0 Lecture objectives

This week will delve into some helpful visualization available through the ggplot2 package. First we’ll go over deciding which visualizations are best for what we want to accomplish and then explore some of these in more detail.

At the end of this lecture you will have covered the following topics

  1. The grammar of graphics.
  2. Scatterplots and their variants.
  3. Barplots and their variants.
  4. Density plots and their variants.
  5. Categorical plots and their variants.

0.3.0 A legend for text format in R markdown

grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink

... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.

Blue box: A key concept that is being introduced

Yellow box: Risk or caution

Green boxes: Recommended reads and resources to learn R

Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.


0.4.0 Lecture and data files used in this course

0.4.1 Weekly Lecture and skeleton files

Each week, new lesson files will appear within your RStudio folders. We are pulling from a GitHub repository using this Repository git-pull link. Simply click on the link and it will take you to the University of Toronto datatools Hub. You will need to use your UTORid credentials to complete the login process. From there you will find each week’s lecture files in the directory /2025-03-Adv_Graphics_R/Lecture_XX. You will find a partially coded skeleton.Rmd file as well as all of the data files necessary to run the week’s lecture.

Alternatively, you can download the R-Markdown Notebook (.Rmd) and data files from the RStudio server to your personal computer if you would like to run independently of the Toronto tools.

0.4.2 Live-coding HTML page

A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!

0.4.3 Post-lecture Files

At the end of each lecture there will be a completed version of the lecture code released as an HTML file under the Modules section of Quercus.

0.4.4 Data used in this lesson

Today’s datasets will continue to focus on using the Ontario public sector salary disclosure, also known as the “Sunshine list”. This list, started in 1996, publishes all public sector servants that with an annual salary at or above $100,000. Although not strictly biological data, this is a great dataset to work with because it contains many observations set across a long time period with enough data to help generate subgroups based on sector, employer, and role!

You can find more information about this dataset on the Ontario public sector salary disclosure webpage

0.4.4.1 Dataset 1: sunshineFinal.df

This dataframe will be found in the Lecture02.RData file and contains and full version of our dataset from last week. While we were unable to practice our wrangling on the data from all years, we can load a full version that has been saved in a more compact binary file!

This is a version of the Sunshine list ranging from 1996-2023. It has been altered by replacing all names with random numeric identifiers and contains nearly 2.5M observations.


0.5.0 Packages used in this lesson

tidyverse which has a number of packages including dplyr, tidyr, stringr, forcats and ggplot2

viridis helps to create color-blind palettes for our data visualizations

ggbeeswarm, and ggridges are new packages we need to install. They’ll help us generate some nice graphs.

# ---------- Install packages here ---------- #
# install.packages("ggbeeswarm", dependencies = TRUE)
# ---------- Source packages here ---------- #
# Packages to help tidy our data
library(tidyverse)

# Packages for the graphical analysis section
library(viridis)

# New visualisation packages
library(ggridges) # ridgeline plots
## Error in library(ggridges): there is no package called 'ggridges'
library(ggbeeswarm)

1.0.0 Why is data visualization important?

How do we extract meaningful insights from our data? If you have previously explored Anscombe’s quartet you’ll know that, as scientists and lay people, we can sometime be obsessed with summary statistics - mean, median, mode, standard deviation. While these values are a helpful way to quickly assess a population, they can be flawed to the point of deception. Instead, we should temper our investigations by visualizing our data. A deeper understanding of your data trends and potential models comes from dissecting attributes of your data which can jump out more easily through visualization.

Equally important is the ability to convey our findings to others. The right visualizations, whether simplistic or complex, should effectively communicate our key message.


1.1.0 The grammar of graphics

One approach to effective data visualization relies on the Grammar of Graphics framework originally proposed by Leland Wilkinson (2005). The idea of grammar can be summarized as follows:

  • Grammar is the foundational set of rules that define the components of a language.
  • A language is built on a structure that consists of syntax and semantics.

The grammar of graphics facilitates the concise description of any components of any graphics. Hadley Wickham of tidyverse fame has proposed a variant on this concept - the layered grammar of graphics framework. By following a layered approach of defined components, it can be easy to build a visualization.

The Major Components of the Grammar of Graphics by Dipanjan Sarkar

We can break down the above pyramid by the base components, building from the bottom upwards.

1. Data: your visualization always starts here. What are the dimensions you want to visualize. What aspect of your data are you trying to convey?

2. Aesthetics: assign your axes based on the data dimensions you have chosen. Where will the majority of the data fall on your plot? Are there other dimensions (such as categorically encoded groupings) that can be conveyed by aspects like size, shape, colour, fill, etc.

3. Scale: do you need to scale/transform any values to fit your data within a range? This includes layers that map between the data and the aesthetics.

4. Geometric objects: how will you display your data within your visualization. Which geom_* will you use?

5. Statistics: are there additional summary statistics that should be included in the visualization? Some examples include central tendency, spread, confidence intervals, standard error, etc.

6. Facets: will generating subplots of the data add a dimension to our visualization that would otherwise be lost?

7. Coordinate system: will your visualization follow a classic cartesian, semi-log, polar, etc. coordinate system?


1.2.0 sunshineFinal.df covers 28 years of data

Let’s look at a full version of our data from last lecture. With this full dataset we can better appreciate some of the figures and visualizations we created last week but also improve upon them based on today’s lecture material.

We’ll begin by loading the data into our session from an .RData file.

# Read in sunshine data
load("./data/Lecture02.RData")

# Check the structure and preview the data
str(sunshineFinal.df)
## tibble [2,447,012 x 7] (S3: tbl_df/tbl/data.frame)
##  $ numericID      : Factor w/ 516966 levels "10000005","10000011",..: 475892 32433 382678 443816 443816 443816 443816 443816 394094 394094 ...
##  $ salary         : num [1:2447012] 194890 115604 149434 109383 173225 ...
##  $ taxableBenefits: num [1:2447012] 711 403 513 4922 4939 ...
##  $ year           : int [1:2447012] 1996 1996 1996 1996 1997 1998 1999 2000 1996 1997 ...
##  $ sector         : Factor w/ 37 levels "Colleges","Crown Agencies",..: 35 35 35 34 34 34 34 34 3 3 ...
##  $ employer       : chr [1:2447012] "Addiction Research Foundation" "Addiction Research Foundation" "Addiction Research Foundation" "Agriculture,Food And Rural Affairs" ...
##  $ title          : chr [1:2447012] "President & Ceo" "Dir., Soc. Eval. Research & Act. Dir., Clin. Research" "V.p., Research & Coordinator, Intern. Programs" "Deputy Minister" ...
head(sunshineFinal.df)

Looking at the data itself, we have:

  1. Year data ranging from 1996 to 2023
  2. There are 37 total unique sectors where our individuals have worked

How many unique individuals are there? How many unique job titles are listed in our dataset?

# Calculate unique individuals
sunshineFinal.df %>% 
  pull(numericID) %>% 
  unique() %>% 
  length()
## [1] 516966
# Calculate unique job titles
sunshineFinal.df %>% 
  pull(title) %>% 
  unique() %>% 
  length()
## [1] 211182

There look to be about 520K individuals and 200K unique job titles in our dataset of 2.5M entries. When we think about plotting our individual data points, that’s a lot of data!


1.2.1 Summarize your data points by grouping individuals

Before we continue on our journey, a quick look under the hood shows that when plotting datapoints, ggplot will plot each point individually! That means if we were to plot all of our observations in the sunshine list, we would be plotting some 2.5M datapoints which can take some time to render. Instead, we will summarize our data by grouping it by individuals. From this, we can gather a few interesting points as well such as how long an individual has been on the list!

# 1.2.1 Generate a summary dataset for individuals across years

# Save our summary to a new variable
sunshineSummary.df <-
  # Pass along the dataset
  sunshineFinal.df %>% 
  # Group based on individual (numericID)
  group_by(numericID) %>% 
  # Summarize by capturing ranges and means of salary and taxable benefits
  summarize(minSalary = min(salary),
            maxSalary = max(salary),
            meanSalary = mean(salary),
            minTaxableBenefits = min(taxableBenefits),
            maxTaxableBenefits = max(taxableBenefits),
            meanTaxableBenefits = mean(taxableBenefits),
            numYears = n(),                                # How many years have they been on the list?
            )

# Look at the summary we've created
str(sunshineSummary.df)
## tibble [516,966 x 8] (S3: tbl_df/tbl/data.frame)
##  $ numericID          : Factor w/ 516966 levels "10000005","10000011",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ minSalary          : num [1:516966] 103334 101221 100971 100671 111280 ...
##  $ maxSalary          : num [1:516966] 122621 112936 100971 166600 168393 ...
##  $ meanSalary         : num [1:516966] 112261 107272 100971 127163 136591 ...
##  $ minTaxableBenefits : num [1:516966] 181 433 708 0 679 ...
##  $ maxTaxableBenefits : num [1:516966] 216 724 708 1030 1135 ...
##  $ meanTaxableBenefits: num [1:516966] 196 586 708 543 966 ...
##  $ numYears           : int [1:516966] 6 5 1 6 12 13 4 6 1 2 ...

Okay, it appears that we’ve decreased our overall number of entries by ~80%! That should simplify our plotting needs. We’re now ready to start visualizing our data, but how will we go about doing it?


1.3.0 Which aspects of the data do we wish to highlight?

Let’s take a look at the following chart from from Dr. Andrew V. Abela.

From the flowchart above we’ll mainly explore looking at relationships and composition using our dataset as a basis for our visualizations.

We’ll cover distributions and comparison with a more complex dataset later in this lecture.


2.0.0 Visualizations help us identify relationships and correlations

Depending on the nature of your data and its dimensions there are a number of plots to choose from to represent and convey relationships between your variables. These various plots can reveal trends, correlations, and ultimately relationships between variables.

In our first lecture, we briefly explored bar charts and a vertical version of scatterplots called the strip plot. As an initial form of graphical assessment today we’ll begin with scatterplots which build a framework for exploring our data further.

Relationship Description Types of charts
Finding correlations in your data Scatterplot (biplot)
Strip plot
Bubble chart
Lineplot
Heatmap
Showing connections between groups Arc diagram
Chord diagram
Connection map
Network diagram
Non-ribbon chord diagram
Tree diagram
Show relationships and connections between the data Heatmap
or showing correlations between two or more variables Marimekko Chart
Parallel coordinates plot
Radar chart
Venn diagram

We’ll focus on just two relational plots in this section: the scatterplot and it’s variant the bubblechart. Parallel coordinate plots will also make an appearance later on during this lecture.


2.1.0 Scatterplots and bubblecharts are geom_point() graphs

When we try to examine relationships, one direction we can take is to look for trends by comparing one variable against another. From an experimental standpoint we can graph our independent variable on the x-axis, looking for changes in our dependent variable/measurement on the y-axis. This is most easily accomplished when both of your variables are on a continuous scale.

When trying to show correlations in your data between two to three variables you can use scatterplots (also known as biplots). Remember there are some limitations when visualizing multi-dimensional data on a two-dimensional canvas. Overall these plots can convey to your audience the potential correlations between your variables or separations between groups based on their colour or shape.

2.1.1 Use the alpha parameter to help visualize overlapping data points

Let’s begin with a scatterplot comparing mean salaries versus the mean taxable benefits. We’ll use each individual observation as a datapoint to compare the two values to see if perhaps, there is a relationship in how one is calculated over the other. Of note, within the geom_point() layer, we’ll be updating the alpha parameter to change the transparency of our points.

Using the “alpha” parameter: Note that alpha = 1 will make completely opaque points while alpha = 0 will make completely transparent points. As points begin to overlap, they will create increasingly opaque areas in your visualization. Generally an opacity value of about 0.5-0.7 is a good place to start when working with a moderate number of overlapping plots.

# 2.1.1 scatterplot of taxable benefits vs salary

sunshineSummary.df %>% 
  ggplot() +
  # 2. Aesthetics
  aes(x = meanSalary, y = meanTaxableBenefits) +
  # set text size
  theme(text = element_text(size = 10)) +
  
  # 4. Geoms
  geom_point(alpha = 0.5)

Hard to tell but it looks like perhaps as salary rises, the taxable benefit tends to decrease. Of course, we do have a major outlier in our data but that may be an anomaly, or some kind of stock split/dividend in this case.

Challenge yourself: how would you identify which of our observations this taxable benefits outlier belongs to?

2.1.2 The bubbleplot adds an obvious dimensionality to your data

The bubbleplot, as we’ll see, brings an additional dimension to your visualization by varying the size of your points based on a (usually) continuous variable. This can help to highlight underlying data trends in a more obvious fashion for your audience.

To achieve this we will set the size parameter in our aesthetics aes() layer. By setting the size parameter to a variable in our data, it will alter the size of our data points. We will augment the results of this by providing a size range through the scale_size() layer.

# 2.1.2 scatterplot of taxable benefits vs salary

sunshineSummary.df %>% 
  ggplot() +
  
    # 2. Aesthetics
    aes(x = meanSalary, y = meanTaxableBenefits,
        size = numYears) +
    # set text size
    theme(text = element_text(size = 10)) +
    
    # 3. Scaling
    ### 2.1.2 set the range for sizing our dots
    scale_size(range = c(1, 5), name = "Years listed") + 
    
    # 4. Geoms
    geom_point(alpha = 0.5)


2.2.0 Scale your axes to help see your data better

While our bubbleplot does add some dimensionality to our visualization we can see a lot of values are bunched up together at the lower end of our scales and likewise we have a few values that are far out on our x/y axes. To even this out, we can transform our axes to display values on a log10 scale. We have two places where we can accomplish this:

  1. Transform your data directly ie log(case_count). Data will be placed on a log value scale but may have less meaning to the lay person.
  2. Transform your scales/tick marks. The data is untouched and the values still have meaning to your audience without having to do conversions in their head.
# 2.2.0 scatterplot of taxable benefits vs salary

sunshineSummary.df %>% 
  ggplot() +
  
    # 2. Aesthetics
    aes(x = meanSalary, y = meanTaxableBenefits,
        size = numYears) +
    # set text size
    theme(text = element_text(size = 10)) +
    
    # 3. Scaling
    ## set the range for sizing our dots
    scale_size(range = c(1, 5), name = "Years listed") + 
  
    ### 2.2.0 Set the x and y-axis to log scales
    scale_x_log10() + 
    scale_y_log10() + 
    
    # 4. Geoms
    geom_point(alpha = 0.2)
## Warning in scale_y_log10(): log-10 transformation introduced
## infinite values.


2.2.1 Avoid log-transforming values of 0!

From our updated bubblplot it now becomes clear that there is an enormous range for taxable benefits regardless of salary. However, there appears to be a soft upper limit of about 105, even as we approach the higher ranges of salaries. We do also see another small problem that comes along with log transformation of our data.

Look closely at our y-axis range and you’ll see values that are below 1x100 . While there may be values less than 1 in our dataset, we see many datapoints at or below 1x10-3 which is 0.001! Less than a penny! The issue here is that many of our datapoints have a meanTaxableBenefit value of 0. In R, if we attempt to take log10(0) we get -Inf because the value can’t exist but rather approaches negative infinity. This creates a value range that doesn’t necessarily exist in our dataset.

Instead, we can make a quick replacement to our data before plotting by replacing any values less than 1 to a minimum value of 1 in our dataset! In this case, it won’t drastically affect our trends but it will drastically improve how our plot looks. We could be more careful and only update values of 0 as well but this will suffice.

Note that we’ll make quick use of the if_else() function in our mutate() call.

# 2.2.1 scatterplot of taxable benefits vs salary, log-transformed without 0 values

sunshineSummary.df %>% 
  ### 2.2.1 Mutate taxable benefit data to avoid values of zero
  mutate(meanTaxableBenefits = if_else(meanTaxableBenefits < 1, 
                                       # add 1 to all mean taxable benefit values of 0
                                       true = 1, 
                                       false = meanTaxableBenefits)) %>% 
  
  # 1. Data
  ggplot() +
  
    # 2. Aesthetics
    aes(x = meanSalary, y = meanTaxableBenefits,
        size = numYears) +
  
    # set text size
    theme(text = element_text(size = 10)) +
    
    # 3. Scaling
    # set the range for sizing our dots
    scale_size(range = c(1, 5), name = "Years listed") + 
  
    ## Set the x and y-axis to log scales
    scale_x_log10() + 
    scale_y_log10() + 
    
    # 4. Geoms
    geom_point(alpha = 0.2)

While we cannot conclude that more years working equates to higher benefits in this graph, we can see that generally higher salaries will have higher taxable benefits.

How do you interpret log-transformed data?: After log-transforming both axes it looks like the relationship between salary and taxable benefits is leaning towards linear. While the transformation has definitely improved the visualization, it has made the interpretation a little harder. Overall, in this situation, you’ll either want to model against the untransformed data or do the math correctly and recognize that the relationship is a power function:

\[\begin{equation*} \begin{aligned} \text{log} y &= B + a \text{log} x &\text{where B is the intercept of the vertical axis and a is the slope}\\[10pt] y &= 10^{(B + a \text{log} x)}\\[10pt] y &= 10^{B} 10^{a \text{log} x}\\[10pt] y &= 10^{B} x^{a} &\text{Recall that } y \text{ln} x = x^{y}\\[10pt] \end{aligned} \end{equation*}\]


2.3.0 Subsetting your data for a better view

While we can gather some information here, there are still perhaps too many data points for a proper interpretation. Some approaches we could take include random subsetting, subsetting from a block or binning our data through filtering.

2.3.1 Quickly slice() your data

One approach to quickly subsetting your data is with the slice() method which can take a start:end argument which effectively takes rows from start to end inclusive. You can also use it to quickly drop specific rows from your dataset. In fact, there are many ways you can use slice() and also many variants of it in the dplyr package.

Let’s rerun our plot code but randomly slice 1000 datapoints instead using the slice_sample() function. Note that we must explicitly assign the number of samples to slice with the parameter n.

# 2.3.1 Slice and use 1000 random datapoints for our figure

# Set a seed for random sampling reproducibility
set.seed(2025)

sunshineSummary.df %>% 
  # Mutate taxable benefit data to avoid values of zero
  mutate(meanTaxableBenefits = if_else(meanTaxableBenefits < 1, 
                                       # add 1 to all mean taxable benefit values of 0
                                       true = 1, 
                                       false = meanTaxableBenefits)) %>% 
  ### 2.3.1 Randomly slice 1000 samples
  slice_sample(n=1000) %>% 
  
  ggplot() +
  
    # 2. Aesthetics
    aes(x = meanSalary, y = meanTaxableBenefits,
        size = numYears) +
    # set text size
    theme(text = element_text(size = 10)) +
    
    # 3. Scaling
    ## set the range for sizing our dots
    scale_size(range = c(1, 5), name = "Years listed") + 
  
    ## Set the x and y-axis to log scales
    scale_x_log10() + 
    scale_y_log10() + 
    
    # 4. Geoms
    geom_point(alpha = 0.2)


Looks like we still see a slightly rising pattern of taxable benefits with increasing salary, at least when we get into the very high ranges!

2.3.2 Bin your data with filtering!

Whereas in the use of slice_sample() we were pretty indiscriminately pulling observations from our data, what if we wanted to compare a specific data range. For example, we could examine salary values to within a smaller range like 100-150K and see how widely taxable benefits can vary. To accomplish this, you need to combine predicates within a filter() call.

# 2.3.2 Filter your data into a binned group

# Set a seed for random sampling reproducibility

sunshineSummary.df %>% 
  # Mutate taxable benefit data to avoid values of zero
  mutate(meanTaxableBenefits = if_else(meanTaxableBenefits < 1, 
                                       # add 1 to all mean taxable benefit values of 0
                                       true = 1, 
                                       false = meanTaxableBenefits)) %>% 
  
  ### 2.3.2 Filter for salaries between 125K and 150K
  filter(meanSalary >= 125000, meanSalary <= 150000) %>% 
  
  ggplot() +
  
    # 2. Aesthetics
    aes(x = meanSalary, y = meanTaxableBenefits,
        size = numYears) +
    # set text size
    theme(text = element_text(size = 10)) +
    
    # 3. Scaling
    ## set the range for sizing our dots
    scale_size(range = c(1, 5), name = "Years listed") + 
  
    ## Set the x and y-axis to log scales
    scale_x_log10() + 
    scale_y_log10() + 
    
    # 4. Geoms
    geom_point(alpha = 0.2)

Looking at a narrow range of all salaries between $125,000 and $150,000 it appears that the overall range of taxable benefits remains the same. One could argue, however, that the overall distribution of taxable benefits does appear to shift slighlty with increasing salary. The datapoints, however, are far too dense to determine this and eyeballing this isn’t the most accurate way to determining the distribution of our data.

2.3.3 Add a trendline to your data with the geom_smooth() layer

Depending on the density of your datapoints, one common method of visualizing a trend in your data is to add a line of best fit. While this often requires further interpretation on its own, it is a quick way to visually assess if there is any relationship between two variables on your biplot.

We can easily add a fitted line (aka a model) to our plot using the geom_smooth() layer. Based on your expectations, you can set themethod parameter to a specific function to generate your model:

  • lm: linear models drawing a simple y = ax+b line
  • loess: local polynomial regression fitting which equates to a determining slope of the line at each point \(x\) based on local/surrounding datapoints!

You can read more about your method options at the tidyverse website.

For now, let’s add a linear model to our plot and see if that gives us any more perspective. At the same time, we’ll also move one of our aes() settings from the aes() layer to directly in the geom_point() layer. We’ll talk more about why in Lecture 03!

# 2.3.3 Add a line of best fit

# Set a seed for random sampling reproducibility

sunshineSummary.df %>% 
  # Mutate taxable benefit data to avoid values of zero
  mutate(meanTaxableBenefits = if_else(meanTaxableBenefits < 1, 
                                       # add 1 to all mean taxable benefit values of 0
                                       true = 1, 
                                       false = meanTaxableBenefits)) %>% 
  
  # Filter for salaries between 125K and 150K
  filter(meanSalary >= 125000, meanSalary <= 150000) %>% 
  
  ggplot() +
  
    # 2. Aesthetics
    aes(x = meanSalary, y = meanTaxableBenefits) +
    # set text size
    theme(text = element_text(size = 10)) +
    
    # 3. Scaling
    # set the range for sizing our dots
    scale_size(range = c(1, 5), name = "Years listed") + 
  
    # Set the x and y-axis to log scales
    scale_x_log10() + 
    scale_y_log10() + 
    
    # 4. Geoms
    geom_point(alpha = 0.2, aes(size = numYears)) +
  
    ### 2.3.2 Add a line of best fit
    geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

Definitely a weak upward trend in our slope. From here, you could investigate and generate the actual linear model to check how significant this relationship is! It would also be great to take a closer look at the distribution of our data on a visual level, but how do we accomplish that?


3.0.0 Distribution plots display frequency and spread across groups or intervals

Many students are familiar with the classic bar chart. Thus far, we’ve already used it to some extent to look at our last lecture’s sunshine data. When comparing groups or populations for differences in their distribution, however, the bar chart falls quite short of the ideal.

When comparing distributions, we again are often concerned with summary statistics - mean, median, standard deviation, confidence intervals. The nature of our data, however, is often not continuous across an entire y-axis interval as a bar chart may suggest but rather our population is the result of values centred, with some variance, around a mean value.

Over the remainder of the lecture we will compare and contrast 4 types of distribution plots for their relevance and when they may be most useful to visualize our data. When applicable to our data visualizations, we will also display all of our data points to better illustrate our distribution versus the visualization. We will examine:

  1. Bar charts or Barplots
  2. Density plots
  3. Box and Whisker plots
  4. Violin plots

3.1.0 The barplot uses the geom_bar() or geom_col() to display data

As we have already seen, the geom_bar() layer can be used to display our data in multiple ways. The height of the bar is proportional to either:

1. The number of observations of each group or

2. The sum of weights applied by a weight aesthetic

In our Lecture 01 examples we actually used the value of our observations (which was a summary data table) to determine height and proportions of our groups by setting the parameter stat="identity". This was the application of a weighting based on the values of the y-axis variable we chose.

A simpler way to accomplish this effect is with geom_col(), which already uses stat_identity() by default to calculate proportions.

geom Description Requirements
geom_bar() Counts observations in your data and (by default) determines height as a proportion of total (by default) Only accepts x OR y parameter in aes()
geom_col() Uses y-axis values as the height of each bar. Requires both x AND y parameters in aes()

Both of these plots use position="stack" by default and proportions of height match observations or sums for multiple values sharing the same x position. Such instances can be displayed independently using position="dodge" or position=dodge2. This is only helpful, however, when the number of x values (ie categories) is lower, otherwise the graph becomes crowded.

Let’s begin with something simpler. We’ll go back to our original dataset, sunshineFinal.df and filter for just a subset of our sectors (non-“Ministry/seconded” data) as well as narrow down our scope to just salaries paid in 2023.

We’ll try to convey the total salary amount paid to each sector for that year as a comparison to see which is “largest” using only the geom_col() layer.

# 3.1.0 build a simply barchart of total salary per sector in 2023

sunshineFinal.df %>%
  # Filter out the "Ministry" datapoints
  filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
         year %in% c(2023)) %>% 
  
  # 1. Data
  ggplot(.) +
      # 2. Aesthetics
      aes(x = sector, y = salary) +
  
      theme(text = element_text(size = 20)) + # set text size
      theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
  
      # 4. Geoms
      ### 3.1.0 Add our bars
      geom_col()   


3.1.1 Use reorder() to sort your data as you plot it

In the case of sunshineFinal.df we can do a little better by ordering our x-axis of sector by total salary. Since our data table is quite simple there are a number of ways to accomplish a sort:

  1. Convert sector to a factor and use fct_reorder() to sort our factors.
  2. Use the reorder() function while generating the plot.

Let’s sort our sector by total salary in descending order using our second choice reorder(). This takes a few parameters as follows:

  • x: an atomic vector (usually a factor but not required) that you want to reorder

  • X: a vector of the same length as x whose element-matched values will be used to determine the order of x

  • FUN: the function you want to apply to X to reorder the values of x

Note that the output of this function will be a vector of type factor for your values x.

# 3.1.1 reorder your barplot on the fly

sunshineFinal.df %>%
  # Filter out the "Ministry" datapoints
  filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
         year %in% c(2023)) %>% 
  
  # 1. Data
  ggplot(.) +
      # 2. Aesthetics
      ### 3.1.1 Use reorder() to alter how our data is displayed on the x-axis
      aes(x = reorder(x = sector, 
                      X = -salary,  # Note the "-" here essentially reverses our salary values!
                      FUN = sum), 
          y = salary, fill = sector) +
  
      theme(text = element_text(size = 20)) + # set text size
      theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
  
      # 4. Geoms
      geom_col()    ## Add our bars


3.2.0 Barplots can help guide the eye

From above, using the reordered barplots we can quickly gauge the difference between sectors as they get sorted by total salary payout. There’s not much need to colour by sector as well but it does add some emphasis to “School boards” and further enforces the idea that we have sorted by population size.

What are we missing from this visualization that would help the reader? It would be nice to know how many individuals make up each sector in that year. Knowing that size would inform a bit more on the nature of the data were are seeing. Thus we can answer, are “School Teachers” simply a much larger pool of individuals, hence their higher total salary?

3.2.1 Use the fill parameter to show quantitative data

One simple way to enhance our barplot is through the use of fill colour. By setting the fill parameter based on the number of observations in each sector we can shade each bar for a visual assessment of total individuals. The simplest way to accomplish this is to summarise() our data by sector before passing it along for visualization.

# 3.2.1 shade bars by a continuous value

sunshineFinal.df %>%
  # Filter out the "Ministry" datapoints
  filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
         year %in% c(2023)) %>% 
  # Move our reordering of sectors into the data wrangling
  mutate(sector = reorder(x = sector, X = -salary, FUN = sum)) %>% 
  # Group based on sector
  group_by(sector) %>% 
  # Summarize to get total Salar and population size
  summarise(totalSalary = sum(salary),
            sectorSize = n()) %>% 
  
  # 1. Data
  ggplot() +
      # 2. Aesthetics
      aes(x = sector, 
          y = totalSalary, 
          fill = sectorSize) +  ### 3.2.1 Set the fill colour to a variable!
  
      theme(text = element_text(size = 20)) + # set text size
      theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
  
      # 4. Geoms
      geom_col()


3.2.2 Add extra geom_* layers to more accurately reflect values

As you can see from above the shading does help us determine that our largest overall payout does go to the largest group. However, unless you have a very good eye for shades, we can’t see the full detail of our sector sizes. We can still see that “School Boards” has a much higher number of employees in that sector but how similar in size are “Universities” through to “Legislative Assembly And Offices” on our figure?

Instead of fill colour, we can add a little detail by layering an additional geom. In this case, we will use geom_point() to add the population of each sector after scaling it 1:100,000.

Why scale sector size up 100,000? Why did we choose to scale the size of each sector to 1:100,000? This is more of a decision based on what we know about the data already. We already know our y-axis totalSalary data scales from 0 to 8x109. Knowing the populations of our sectors can range from 0 to 80000, it makes sense to try and get our values along the same scale.

# 3.2.2 add points to represent sector size

sunshineFinal.df %>%
  # Filter out the "Ministry" datapoints
  filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
         year %in% c(2023)) %>% 
  # Move our reordering of sectors into the data wrangling
  mutate(sector = reorder(x = sector, X = -salary, FUN = sum)) %>% 
  # Group based on sector
  group_by(sector) %>% 
  # Summarize to get total Salar and population size
  summarise(totalSalary = sum(salary),
            sectorSize = n()) %>% 
  
  # 1. Data
  ggplot() +
      # 2. Aesthetics
      aes(x = sector, 
          y = totalSalary) +
  
      theme(text = element_text(size = 20)) + # set text size
      theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text

      guides(colour = guide_legend(title="Sector\nsize *10^5")) +

      # 4. Geoms
      geom_col() +
      
      ### 3.2.2 Add points to represent sector size
      geom_point(aes(y=sectorSize*100000, colour="red"), size = 5)


3.2.3 The lollipop plot: a sweet twist on the barplot.

Now we have more clarity on populations sizes between sectors, answering our question about the difference in sector sizes. We can see, for instance that the “Colleges” sector has more individuals than “Ontario Power Generation”, suggesting that the mean salary between individuals is higher in the latter group! To publish this figure we would want to fix a few additional things like the legend presentation, the axis names and their titles and perhaps spruce up the colours. We’ll drill into these ideas more during our next lecture!

On this same topic, let’s visit one last plot that can use some pieces from the two variants of barplots we’ve used. The lollipop graph clarifies that x-axis values are more singular in nature rather than spanning a range while still visually connecting those y-axis values to their x-axis categories.

It looks very much like it sounds, and we’ll add an extra twist to ours by setting the point size to the sector size, giving it just a bit more of informational dimension. To accomplish this visualization we’ll combine a geom_point() with a geom_segment().

Note that the geom_segment() layer requires pairs of both the x and y values as arguments (x, xend, y, and yend).

Aesthetics when working with multiple geoms: As you’ll see in the following code, we’ve made some adjustments due to working with multiple geom_* layers. Since there can be overlapping aesthetics between geoms, you need to be cognizant of their effects. Rather than set these parameters in the aes() layer, set them directly in their respective geoms.

# 3.2.3 Generate a lollipop chart

sunshineFinal.df %>%
  # Filter out the "Ministry" datapoints
  filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
         year %in% c(2023)) %>% 
  # Move our reordering of sectors into the data wrangling
  mutate(sector = reorder(x = sector, X = -salary, FUN = sum)) %>% 
  # Group based on sector
  group_by(sector) %>% 
  # Summarize to get total Salar and population size
  summarise(totalSalary = sum(salary),
            sectorSize = n()) %>% 
  
  # 1. Data
  ggplot() +
      # 2. Aesthetics
      aes(x = sector, 
          y = totalSalary) +
  
      theme(text = element_text(size = 20)) + # set text size
      theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
      
      # Use this layer to set the size range of your datapoints
      scale_size(range = c(1, 10), name = "Sector\nsize") + 

      # 4. Geoms
      ### 3.2.3 Make the stick of the lollipop
      geom_segment(aes(x=..., xend=..., 
                       y=..., yend=...)) +
      
      ### 3.2.3 Add points to represent sector size
      geom_point(aes(size = ...), colour = "orchid")
## Error in layer(data = data, mapping = mapping, stat = stat, geom = GeomSegment, : '...' used in an incorrect context

3.3.0 Barplots to convey proportion or composition

Let’s revisit a version of our graphs from the end of last lecture. We’ll use sunshineFinal.df to begin exploring a slightly bigger picture for our data, looking for trends that spanned over all the years of our datasets.

In our first lecture we looked at the other helpful visualization that barplots can produce: composition and proportions. Here we combine the observations for each sectors to help convey an added dimension to our data.

The key to our following figure is in the aes() layer where we set:

  • x: the x-axis values here are usually categorical or discrete in nature

  • fill: by nature our bars will stack by some categorical variable in our data

You’ll notice, however, that we do not set a y-axis value. This is because geom_bar(), by default, counts the observations of each subsection within each bar.

# 3.3.0 simple bar chart of sector size vs year

sunshineFinal.df %>%
  # Filter out the "Ministry" datapoints
  filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE)) %>% 
  # Move our reordering of sectors into the data wrangling
  mutate(sector = reorder(x = sector, X = -salary, FUN = sum)) %>% 
  
  # 1. Data
  ggplot() +
      # 2. Aesthetics
      aes(x = year, fill = sector) +

      theme(text = element_text(size = 20)) + # set text size
      theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text

      # 4. Geoms
      # Start with a simple bar graph
      ...
## Error in eval(expr, envir, enclos): '...' used in an incorrect context

3.3.1 Change the positioning of subgroups in your data with the position parameter of geom_bar()

Taking a quick look at the overall result, there’s a lot of information here. From this graph, we can quickly see just how many salaries have increased to above $100K over nearly 3 decades. We have gone from less than 10K individuals to nearly 300K in 2023. The problem with our figure, however, is that because of how it has scaled so drastically, we cannot really pull much meaningful subgroup proportion information from 1996 vs what we can see more distinctly in our 2023 proportion.

Instead, let’s take advantage of the geom_bar() parameter position which has some possible values including via layer position adjustments:

  • position_stack(): this is the default and what see above where each fill category has been stacked one upon another

  • position_dodge(): this separates each fill category into it’s own separate small bar

  • position_fill(): this can give proportions of the whole for each subgroup at every x-position on the figure. This essentially normalizes each x-category position within itself.

Let’s give position_fill() a try and see how that updates our figure

# 3.3.1 simple bar chart of sector size vs year

sunshineFinal.df %>%
  # Filter out the "Ministry" datapoints
  filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE)) %>% 
  # Move our reordering of sectors into the data wrangling
  mutate(sector = reorder(x = sector, X = -salary, FUN = sum)) %>% 
  
  # 1. Data
  ggplot() +
      # 2. Aesthetics
      aes(x = year, fill = sector) +

      theme(text = element_text(size = 20)) + # set text size
      theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text

      # 4. Geoms
      ### 3.3.1 present each sector as a proportion of the whole for each year on the x-axis
      geom_bar(position = ...)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context

What a difference that single parameter can make. Now, we can see more interesting details like:

  • A major change to proportions in sector sizes between 2002 and 2003. This is likely to reflect a re-organization of the sectors overall. “Hospitals” pretty much disappears and we are “Hospitals and Boards of Public Health” appears to replace it. “Ontario Public Sector” is likely broken down into some other sectors as it virtually disappears in 2003 as well.

  • 2021-2023 does see an uptick in the proportion of “School Boards” positions. From our current graph, however, we must recall that all things are calculated as a proportion and that what we see is on relative scale.


3.3.2 Setting our x-limits with xlim()

From above, to avoid some of the strange proportioning we get from 1996-2002, you have a couple of easy options on how you could trim your data:

  1. Use the filter() layer to limit the dates you’d like to see.

  2. Change the x-axis limits using the xlim() layer. For this, you simply need to supply the left and right limits as values. Note that the order of your values does matter! You can achieve similar control with the y-axis and the ylim() layer.

Filtering versus axis limits: When you filter() your data, you may be looking for observations in a specific range (ie 1-10) but only get observations returned that exist as a subset within that range (ie 4-9). When you plot your data, depending on the data type ggplot will only plot to the upper (and sometimes lower) range of your data. In our current example, you would expect to see an axis ranging 1-10 based on your filter, but ggplot might only produce 0-9 or even 4-9!

In the case of xlim() or ylim() this sets a hard range on your data to be displayed, whether or not your have observations that fall outside of that range as well! Any data that outside the specified range(s) will simply not be plotted. In that case, you may miss unexpected data! However, in a series of figures, setting the same x- or y-axis limits can allow your visualizations to be more consistent and uniform looking. Having the same y-axis limit, for instance, can also help your readers better compare the difference between two experiments so the effect of your different conditions is much more visible.

Know your geoms, axis types and their limits: A quick caution here that depending on the type of geom you are using and the data type for your axis, ggplot can inadvertently cut your data off. Due to the nature of barplots, for instance, if your x-axis is numeric as a data type, then each vertical bar is centred on a value +/- 0.5. If your limit is set at the same value as a column, then said column will not be rendered using default geom parameters.

# 3.3.2 trimmed bar chart of sector size vs year

sunshineFinal.df %>%
  # Filter out the "Ministry" datapoints
  filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE)) %>% 
  # Move our reordering of sectors into the data wrangling
  mutate(sector = reorder(x = sector, X = -salary, FUN = sum)) %>% 
  
  # 1. Data
  ggplot() +
      # 2. Aesthetics
      aes(x = year, fill = sector) +

      theme(text = element_text(size = 20)) + # set text size
      theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
      
      # 3. Scaling
      ### 3.3.2 Limit our x-axis to between 2003 and 2023 (why do we add 1 on each side?)
      ... +
  
      # 4. Geoms
      # present each sector as a proportion of the whole for each year on the x-axis
      geom_bar(position = position_fill())
## Error in eval(expr, envir, enclos): '...' used in an incorrect context

3.3.3 Alter the width of your stacks and dodge your barplots!

Some last things to keep in mind when working with your barplots is that you are not explicitly required to stack them. You can in fact, split your subgroups into a side-by-side comparison. In this case, we could see all sectors are their own barplots for each year! This can make most sense when you are working with smaller x-axis ranges and can give you a bit of different spin on how your plots look. To accomplish this, use the position parameter and set it to position_dodge() or position_dodge2() instead. The latter is especially helpful when dodging for geom_boxplot() which may have variable widths. We’ll get more into that later!

At the same time, we’ll explore the width parameter (default value = 0.9) which allows you to alter the thickness of your bars. Sometimes this just allows you to add more space between your bars if it feels overcrowded, but can also come in handy when dodging your barplots. You can also use this parameter to make your bars overlap!

# 3.3.3 Dodging your barplots

sunshineFinal.df %>%
  # Filter out the "Ministry" datapoints
  filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE)) %>% 
  # Move our reordering of sectors into the data wrangling
  mutate(sector = reorder(x = sector, X = -salary, FUN = sum)) %>% 
  
  # 1. Data
  ggplot() +
      # 2. Aesthetics
      aes(x = year, fill = sector) +

      theme(text = element_text(size = 20)) + # set text size
      theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
      
      # 3. Scaling
      # Limit our x-axis to between 2003 and 2023 (why do we add 1 on each side?)
      xlim(2002, 2023) +
  
      # 4. Geoms
      ### 3.3.3 Set your position parameter
      geom_bar(position = ..., width = ...)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context

While we could probably also see some of these trends with a lineplot, we can now see that year-over-year each sector has been growing in the number of individuals over the 100K mark. The only exceptions we can quickly spot are judiciary - which has held pretty steady for many years likely due to the nature of the size of the court system and how many judges are available for instance! We also see that School Boards have failed to grow over the last 3 years although a deeper dive could reveal any number of factors - a steady state of retirement balancing out promotions and wage increases, or a freeze on hiring concurrent with a wage freeze or some combination of those possibilities!

Note how altering the width parameter doesn’t change the distance between our mini-bars but it does alter the distance between our groups by year!


3.4.0 Convert your plots to a circular layout with coord_polar()

Generally speaking there are a number of circular or polar coordinate plot variants ranging in complexity from the pie chart to a racetrack chart. The coord_polar() layer takes a few parameters:

  • theta: determines if angles will be mapped to the x or y variable
  • start: offset from the “12 o’clock” position
  • direction: 1 = clockwise, -1 = counter-clockwise

Here’s a summary

Chart name Plot description Uses Cartesian analog
Pie chart Sliced up proportions of a circle Simple proportion comparison between groups single bar variant
Nightingale/Coxcomb plot Intervals are equal wedges but shaded areas represent proportions Intervals with additional categorical proportions Stacked column plot, theta=x
Race track plot Intervals are split concentrically to form rings of equal width Helpful if some groups are less diverse Stack column plot, theta=y

3.4.1 Pie charts work best with simple data

From our pie chart description and from your own experience, you can recall that pie charts aren’t helpful with complex data. Once the number of categories surpasses 6-8 groups, things can get hard to visually digest - especially if one category dominates the others.

Let’s continue with sunshineFinal.df but focus in again on 2023 to get a breakdown of sector salary total payout. We’ll group and summarise our data again before passing it along to ggplot(). The key here will be converting a geom_bar() barplot to a pie chart using the coord_polar() layer. The coord_polar() layer will take an argument for the parameters theta which can be set to the x- or y-axis.

To accomplish this feat it helps to think about how the data is being handled. If you think about a pie chart, you can imagine it much like a stacked barplot where the top and bottom have been curved around to meet each other so we are connecting our bar along the y-axis! This is a plot all about proportions and instead of having an x-axis defined, it is the proportions for a single category or group.

Therefore, we set the aes() parameter x to a value of "" so there are no categorical values for x but you will still need to set a y value from which to determine those proportions.

Note that we are introducing two common layers to help with titles in our figure: ggtitle() and guides(). We won’t discuss these in detail here but they can basically be add titles to your plot and legend respectively.

We will also briefly introduce scale_fill_viridis_d() to our figures. This layer comes from the viridis package which contains a number of colour-blind friendly patterns that can also be used to create more grey-scale accurate colour gradients as well!

# 3.4.1 Building a simple pie chart

sunshineFinal.df %>%
  # Filter out the "Ministry" datapoints
  filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
         year == 2023) %>% 
  # Move our reordering of sectors into the data wrangling
  mutate(sector = reorder(x = sector, X = -salary, FUN = sum)) %>%
  # Group based on sector
  group_by(sector) %>% 
  # Summarize to get total salary and population size
  summarise(totalSalary = sum(salary),
            sectorSize = n()) %>% 

  # 1. Data
  ggplot() +
      # 2. Aesthetics
      aes(x = "", y = totalSalary, fill = sector) +
  
      theme_void() + # This will remove all background theme objects
      theme(text = element_text(size = 14)) + # set text size
    
      guides(fill = guide_legend(title="Public Sector")) +
      ggtitle("Proportion of total salaries paid by sector in 2023") +
    
      # 3. Scaling
      scale_fill_viridis_d() + # the "d" stands for discrete colour scale

      # 4. Geoms
      # Set up our barplot here
      geom_bar(stat="identity", 
               colour = "white") +   # Set the colour between segments
    
      # 7 Coordinate system
      ### 3.4.1 Map angles to the Y-axis
      ...
## Error in eval(expr, envir, enclos): '...' used in an incorrect context

3.4.2 Nightingale rose charts emphasize proportions

Made famous by Florence Nightingale, these plots (which she named coxcomb plots) were used to help emphasize the proportions of soldier deaths due to infection during the Crimean war. Each circular increment is equally spaced to represent increasing values along the y-axis. Visually, this gives more area representation as we radiate outwards on the chart. This produces a chart that

  1. Makes distinguishing larger proportions more obvious.
  2. Adds a new dimension to the pie chart, allowing us to create intervals for comparison.
  3. Helps connect our start and end category, especially if they don’t represent an ordinal or continuous scale.

Florence Nightingale’s original rose chart/coxcomb plot publication. Sourced from the Wikimedia commons

Know your y-limits! Outer segments disproportionately represent increases in value. Their larger area produces more visual emphasis despite the linear scale of the y-axis. Due to the disproportionate area increases with increasing y-axis values, you need to consider the range of your data.

Let’s extend our pie chart into a coxcomb chart where we’ll explore data from 2011 to 2023.

We’ll add in the xlab() and ylab() layers so we can rename our x- and y-axis titles as well. You may also notice a couple of extra calls to the theme() layer and we’ll spend plenty of time next lecture discussing how to best uses these layers to customize your plots!

# 3.4.2 Building our first coxcomb plot

sunshineFinal.df %>%
  # Filter out the "Ministry" datapoints
  filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
         year %in% c(2011:2023),
         as.integer(sector) %in% c(1:7)) %>% 
  # Move our reordering of sectors into the data wrangling
  mutate(sector = reorder(x = sector, X = -salary, FUN = sum),
         year = factor(year)) %>%
  # Group based on sector
  group_by(sector, year) %>% 
  # Summarize to get total Salary and population size
  summarise(totalSalary = sum(salary),
            sectorSize = n()) %>% 

  # 1. Data
  ggplot() +
      # 2. Aesthetics
      aes(x = year, y = sectorSize, fill = sector) +
  
      theme_bw() + # make the theme more plain
      theme(text = element_text(size = 20)) + # set text size
      theme(panel.grid.major.y = element_line(color="grey60")) + # darken our major y grid
    
      guides(fill = guide_legend(title="Public Health Unit")) +
      xlab("Year") + # Set the x-axis label
      ylab("Sector Size") + # Set the y-axis label
      ggtitle("Sector size 2011-2023") +
    
      # 3. Scaling
      scale_fill_viridis_d() + # the "d" stands for discrete colour scale

      # 4. Geoms
      geom_col(width = 0.9) + # Squeeze our bars together

      # 7 Coordinate system
      ### 3.4.2 Map angles to the x-axis
      coord_polar(theta = ...)
## `summarise()` has grouped output by 'sector'. You can override using
## the `.groups` argument.
## Error in eval(expr, envir, enclos): '...' used in an incorrect context

Looking at our Nightingale plot, we can see how much emphasis is put on larger portions within each stacked bar. Just be careful, because:

  1. Too many x-axis categories can overcrowd your plot

  2. Having a large range of values in your y-axis (especially between x-axis categories) can also make it very hard to read your/distinguish smaller parts of your stack.


3.4.3 Racetrack plots are an aesthetic variant of the bar chart

You want to make another cool chart from a boring bar chart. Stacked or otherwise, you can convert those bar plots to a racetrack or radial bar chart. The only difference between the coxcomb plot and radial bar chart is that instead of the default theta="x" we set it to y. Like flipping the coordinates for a bar chart.

Beware the racetrack transformation: This is an aesthetic transformation where each bar gets relatively longer as you move from inside to out. Therefore, values must be judged by radians! Our eyes can more precisely judge differences on a Cartesian bar chart. Thus when viewing these types of charts, don’t be misled by how they look at a casual glance!

Let’s make a set of barchart data and compare it to the racetrack plot. Note that negative values in our dataset are plotted separately. It is an implementation quirk of ggplot.

# 3.4.3 Generate a coord-flipped cartesian plot of our last figure

sunshineFinal.df %>%
  # Filter out the "Ministry" datapoints
  filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
         year %in% c(2011:2023),
         as.integer(sector) %in% c(1:7)) %>% 
  # Move our reordering of sectors into the data wrangling
  mutate(sector = reorder(x = sector, X = -salary, FUN = sum),
         year = factor(year)) %>%
  # Group based on sector
  group_by(sector, year) %>% 
  # Summarize to get total Salary and population size
  summarise(totalSalary = sum(salary),
            sectorSize = n()) %>% 

  # 1. Data
  ggplot() +
      # 2. Aesthetics
      aes(x = year, y = sectorSize, fill = sector) +
  
      theme_bw() + # make the theme more plain
      theme(text = element_text(size = 20)) + # set text size
      theme(panel.grid.major.y = element_line(color="grey60")) + # darken our major y grid
    
      guides(fill = guide_legend(title="Public Health Unit")) +
      xlab("Year") + # Set the x-axis label
      ylab("Sector Size") + # Set the y-axis label
      ggtitle("Sector size 2011-2023") +
    
      # 3. Scaling
      scale_fill_viridis_d() + # the "d" stands for discrete colour scale
      ### 3.4.3 Flip our bars to the y-axis
      ... +

      # 4. Geoms
      geom_col(width = 0.9) # Squeeze our bars together
## `summarise()` has grouped output by 'sector'. You can override using
## the `.groups` argument.
## Error in eval(expr, envir, enclos): '...' used in an incorrect context

Now just imagine what our figure would look like if we could connect the right side of the bars to the left side!

# 3.4.3 Switch up the plot to a racetrack plot

sunshineFinal.df %>%
  # Filter out the "Ministry" datapoints
  filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
         year %in% c(2011:2023),
         as.integer(sector) %in% c(1:7)) %>% 
  # Move our reordering of sectors into the data wrangling
  mutate(sector = reorder(x = sector, X = -salary, FUN = sum),
         year = factor(year)) %>%
  # Group based on sector
  group_by(sector, year) %>% 
  # Summarize to get total Salary and population size
  summarise(totalSalary = sum(salary),
            sectorSize = n()) %>% 

  # 1. Data
  ggplot() +
      # 2. Aesthetics
      aes(x = year, y = sectorSize, fill = sector) +
  
      theme_bw() + # make the theme more plain
      theme(text = element_text(size = 20)) + # set text size
      theme(panel.grid.major.y = element_line(color="grey60")) + # darken our major y grid
    
      guides(fill = guide_legend(title="Public Health Unit")) +
      xlab("Year") + # Set the x-axis label
      ylab("Sector Size") + # Set the y-axis label
      ggtitle("Sector size 2011-2023") +
    
      # 3. Scaling
      scale_fill_viridis_d() + # the "d" stands for discrete colour scale

      # 4. Geoms
      geom_col(width = 0.9) + # Squeeze our bars together

      # 7 Coordinate system
      ### 3.4.3 Map angles to the y-axis
      coord_polar(...)
## `summarise()` has grouped output by 'sector'. You can override using
## the `.groups` argument.
## Error in eval(expr, envir, enclos): '...' used in an incorrect context

3.4.4 Barplots and their variants summarize the distribution of data between groups or intervals but NOT within single populations.

Often for our purposes as scientists, we are more concerned with the distribution of replicates or measurements as a population within a group whilst also comparing across other groups (ie control vs treatment). Naively, we are tempted by the convenience of Excel to simplify our data as a bar chart with some simple standard deviation or error bars. One word concisely describes this behaviour:

\[\text{Inappropriate}\]

Note from our examples that although we are comparing sectors, our datapoints represent single or total observations separated either by group or by time. At each x-axis point we are not comparing the distribution between categories in a meaningful way. In fact, from our first example, we can convey a near-exact message using a dot-plot! In essence these visualizations are communicating the categorization of samples across multiple groups. More or less, this is about visualizing proportions.

Section 3.0.0 Comprehension Question: After exploring the racetrack variant above, visualizing our data ranging from 2011 to 2023, what is the biggest issue you might see? What would be one way to remedy this? Is this an appropriate visualization of our data?

# Code your plot here!

4.0.0 Density plots compute a theoretical distribution

What is the shape of our data? When we encounter experimental populations and begin sampling or measuring for specific characteristics, we inevitably begin to reveal whether or not that characteristic has a predictable range, median value, etc. Density plots, once given enough sample values, begin to predict the shape or distribution of that sampling. We sometimes use this to represent the theoretical distribution of our original populations.

4.0.1 Distribution plots need appropriate data

In contrast to our previous datasets, we are interested in dissecting group characteristics after sampling multiple times. In this case, we have an interesting dataset for comparison because we can look at the distribution of sunshine salaries across each sector in our data. Are some sectors paid more or less or have a wider distribution of salaries?

We have many observations we can compare in each group (some sectors of course are larger than others) but these datapoints can give us a window into how our public servants are paid.

4.1.0 Density and histogram plots model distribution of samples or individuals within a population

While bar plots help to focus on proportional representation for categorical data, both density plots and histograms can be used to convey the frequency or distribution of values for a variable across its specified range. When comparing distributions visualized this way, you can usually compare up to 3 or 4 on the same plot before the data becomes too crowded. These methods also give you a sense of your data before moving forward with more detailed analyses. You can also identify possible mistakes or artefacts in data collection.

How much data do I need? Density plots can be thought of as smoothed versions of histograms which have been binned in small intervals. Density plots of a single dimension require a minimum of 4 samples but justifying a KDE on a sample size that small is hard. Histograms are recommended to have at least 30 observations to be considered useful and I would apply this rule of thumb to KDEs as well.

Let’s pick a few sectors to plot based on salary. We’ll re-filter our data to look at just a portion of sectors much like our nightingale plots for just the year of 2023. In this figure we’ll introduce 2 geoms:

  • geom_density(): this is the layer we can use to create a kernel density estimate. In theory, the area under each curve should be 1 as they represent the probability of any salary value within a range of the data.

  • geom_rug(): this simple geom will sit outside our plot in what would be considered the “margins” of the plot. Each vertical line will represent an actual observation from the data.

# 4.1.0 Generate a kernel density estimate with a rug

sunshineFinal.df %>%
  # Filter out the "Ministry" datapoints
  filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
         year %in% c(2023),
         sector %in% c("Colleges", "Universities", "Judiciary", "Hospitals And Boards Of Public Health")) %>% 

  # 1. Data
  ggplot() +
      # 2. Aesthetics
      aes(x = salary, fill = sector) +
  
      theme_bw() + # make the theme more plain
      theme(text = element_text(size = 20)) + # set text size
      theme(panel.grid.major.y = element_line(color="grey60")) + # darken our major y grid
    
      guides(fill = guide_legend(title="Public sector")) +
      xlab("Salary") + # Set the x-axis label
      ylab("Density") + # Set the y-axis label
      ggtitle("Sector pay distribution for 2023") +
    
      # 3. Scaling
      scale_fill_viridis_d() + # the "d" stands for discrete colour scale
      scale_colour_viridis_d() +
      xlim(50000, 500000) +

      # 4. Geoms
      ### 4.1.0 Add a density plot
      ... + # Plot the density estimates
      ### 4.1.0 Add a rug plot to the x-axis margin
      ...(aes(colour = sector))
## Error in eval(expr, envir, enclos): '...' used in an incorrect context

Looking at our 4 chosen sectors, we see some interesting density plots.

  • Colleges have a very tight distribution with a very steep drop-off around $150K

  • Universities, on the other hand, have a much longer tail leading outward, suggesting higher salary range, but also a more even distribution along that range. However the highest peak in density is at a smaller salary than those in the College group.

  • Hospitals and Boards of Public Health also have a similar mode for salary as Universities but with a much tighter range for the bulk of its distribution.

  • Judiciary is our most interesting set of data as it represents a bimodal distribution. There are two distinct groupings of salaries in the Judiciary system representing a “lower” range in the $180K range and a “higher” group in the $350K range.


4.1.1 Apply the after_stat() function to scale your data

There are a number of after_*() functions that can be applied to your plots. The overall purpose of these is to influence the aesthetics parameters (ie x, y, fill, colour, etc) based on some post-geom() transformations of your data. For instance, we know that geom_bar() counts up observations for category in the x-axis. You can use after_stat(count) to extract the total observations per x-axis category as a computed variable.

Looking carefully at the documentation of each geom_*(), you will find which computed variables are calculated by each specific geom. For geom_density() we can extract density, number of points, and a scaled density estimate.

Let’s look at using after_stat(scaled) to set our y-axis range of values in our previous geom to between 0 and 1.

# 4.1.1 normalize or scale our KDEs

sunshineFinal.df %>%
  # Filter out the "Ministry" datapoints
  filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
         year %in% c(2023),
         sector %in% c("Colleges", "Universities", "Judiciary", "Hospitals And Boards Of Public Health")) %>% 

  # 1. Data
  ggplot() +
      # 2. Aesthetics
      aes(x = salary, fill = sector) +
  
      theme_bw() + # make the theme more plain
      theme(text = element_text(size = 20)) + # set text size
      theme(panel.grid.major.y = element_line(color="grey60")) + # darken our major y grid
    
      guides(fill = guide_legend(title="Public sector")) +
      xlab("Salary") + # Set the x-axis label
      ylab("Scaled density") + # Set the y-axis label
      ggtitle("Sector pay distribution for 2023") +
    
      # 3. Scaling
      scale_fill_viridis_d() + # the "d" stands for discrete fill scale
      scale_colour_viridis_d() + # the "d" stands for discrete colour scale
      xlim(50000, 500000) +

      # 4. Geoms
      ### 4.1.1 scale the KDE plot
      geom_density((aes(y = ...)), alpha = 0.5) + # Plot the density estimates
  
      ### Add a rug plot to the x-axis margin
      geom_rug(aes(colour = sector))
## Error in layer(data = data, mapping = mapping, stat = stat, geom = GeomDensity, : '...' used in an incorrect context

By setting the after_stat() option in our density plot, now all of our densities have a maximum height of 1. This can make it easier to see all of our densities at the same height. This is especially helpful when some are very skewed or have very long tails making them much “lower” without scaling. Now you can more easily estimate the overall distribution of your population along certain points.


4.1.2 Apply a facet_*() to view KDEs separately

As you can see from above, if we were to have more and more sectors, it may be better to separate them out in order to avoid too much overlap. It may also be to our advantage to compare them in a more vertical fashion. Recall that there are two layers we can choose from: facet_wrap() and facet_grid(). They are differentiated by the following characteristics:

  • facet_wrap(): Data is split based on available data combinations of the facets parameter which can be defined by a formula like ~var1+var2 or by using the quoting function vars(var1, var2). In either case, the data is then grouped by your variables. Panels are placed in a single ribbon that wraps around based on the arguments in the ncol and nrow parameters.
  • facet_grid(): Data is split based on the 1 or 2-dimensional combination of facet variables. Even combinations for which there is no data will be displayed as empty panels. Use the rows and cols parameters to set the facets by using the vars() quoting function. Note that you could further group your rows and/or columns by multiple variables.

Note we’ll also use another parameter in both called scales where we can determine if panels should share axis scales or change them based on individual panel data in the y-axis (free_y), x-axis (free_x) or both (free),

Let’s first use facet_wrap() to generate a single-column facet of KDEs based on sector pay for 2023 of our dataset.

# 4.1.2 Switch up the plot to a faceted KDE plot

sunshineFinal.df %>%
  # Filter out the "Ministry" datapoints
  filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
         year %in% c(2023)) %>% 

  # 1. Data
  ggplot() +
      # 2. Aesthetics
      aes(x = salary, fill = sector) +
  
      theme_bw() + # make the theme more plain
      theme(text = element_text(size = 20)) + # set text size
      theme(panel.grid.major.y = element_line(color="grey60")) + # darken our major y grid
    
      guides(fill = guide_legend(title="Public sector")) +
      xlab("Salary") + # Set the x-axis label
      ylab("Density") + # Set the y-axis label
      ggtitle("Sector pay distribution for 2023") +
    
      # 3. Scaling
      scale_fill_viridis_d() + # the "d" stands for discrete fill scale
      xlim(50000, 400000) +

      # 4. Geoms
      ### 4.1.2 Note how we no longer need to scale our y-axis?
      geom_density(alpha = 0.5) + # Plot the density estimates
  
      # 6. Facets
      ### 4.1.2 Add a simple facet wrap, with each sector group existing in its own row
      facet_wrap(..., 
                 scales=..., 
                 ncol=1)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context

4.1.3 Use facet_grid() to view multiple periods of our data

If, for example we wanted to directly compare the data from two years (2015 and 2023), we could alter the above plot so that our two years are paneled beside each other. This can be accomplished with the facet_grid() layer which will allow us to name both a rows and cols grouping parameter.

Let’s update our figure so that we split our rows by sector and our columns by year.

# 4.1.3 Switch up the plot to a racetrack plot

sunshineFinal.df %>%
  # Filter out the "Ministry" datapoints
  filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
         year %in% c(2015, 2023)) %>% 

  # 1. Data
  ggplot() +
      # 2. Aesthetics
      aes(x = salary, fill = sector) +
  
      theme_bw() + # make the theme more plain
      theme(text = element_text(size = 20)) + # set text size
      theme(panel.grid.major.y = element_line(color="grey60")) + # darken our major y grid
    
      guides(fill = guide_legend(title="Public sector")) +
      xlab("Salary") + # Set the x-axis label
      ylab("Density") + # Set the y-axis label
      ggtitle("Sector pay distribution 2015 vs. 2023") +
    
      # 3. Scaling
      scale_fill_viridis_d() + # the "d" stands for discrete fill scale
      xlim(50000, 400000) +

      # 4. Geoms
      ### 4.1.2 Note how we no longer need to scale our y-axis?
      geom_density(alpha = 0.5) + # Plot the density estimates
  
      # 6. Facets
      ### 4.1.3 Add a simple facet grid, with each sector group existing in its own row 
      # and two columns of year data
      facet_grid(rows = ...,
                 cols = ...,
                 scales="free")
## Error in eval(expr, envir, enclos): '...' used in an incorrect context

While the above visualization is pretty clear, it certainly takes up a lot of extra space. Perhaps there is a better way to generate this kind of visualization?

4.2.0 Plot multiple distributions with a ridgeline plot

Ridgeline plots (sometimes called Joyplots) can generate a compact way to show multiple distributions of numeric values across several groups. The distributions can be shown as either histograms or density plots with all of them aligned to the same horizontal axis. The vertical axis is compressed slightly to generate an overlap between categories.

To generate these visualizations we can use the ggridges package which is an extension of ggplot2. In this case, that means it uses the same grammar and can be added as a layer call to geom_density_ridges(). A parameter to keep in mind:

  • scale: sets the vertical distance between ridgelines

    • 1: the tallest density curve just touches the baseline of the next vertical category

    • Above 1: increasing overlap

    • Below 1: increasing separation

Let’s begin by replicating our faceted plot from above which compares the 2015 vs 2023 salary distributions across sectors.

# 4.2.0 Generate our ridgeline plot to compress our KDEs

sunshineFinal.df %>%
  # Filter out the "Ministry" datapoints
  filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
         year %in% c(2015, 2023)) %>% 
         # sector %in% c("Colleges", "Universities", "Judiciary", "Hospitals And Boards Of Public Health")) %>% 

  # 1. Data
  ggplot() +
      # 2. Aesthetics
      aes(x = salary, y = sector, fill = sector) +
  
      theme_bw() + # make the theme more plain
      theme(text = element_text(size = 20)) + # set text size
      theme(panel.grid.major.y = element_line(color="grey60")) + # darken our major y grid
      theme(legend.position = "none") + # drop the legend since it's redundant
    
      guides(fill = guide_legend(title="Public sector")) +
      xlab("Salary") + # Set the x-axis label
      ylab("Density") + # Set the y-axis label
      ggtitle("Sector pay distribution 2015 vs. 2023") +
    
      # 3. Scaling
      scale_fill_viridis_d() + # the "d" stands for discrete fill scale
      xlim(50000, 400000) +

      # 4. Geoms
      ### 4.2.0 Note how we no longer need to scale our y-axis? Instead we can scale for overlap
      ...(scale = ..., 
                          alpha = ...) + 
  
      # 6. Facets
      ### 4.1.2 Add a simple facet wrap, with each sector group existing in its own row
      facet_grid(~year,
                 scales="free_x")
## Error in ...(scale = ..., alpha = ...): could not find function "..."

4.3.0 Use geom_density_ridges_gradient() to fill densities on a gradient

From above you can now see that we’ve somewhat compacted all that dimensional data into a single plot that still clearly conveys the difference in overall salary distribution within sectors. The distributions across our categories suggest that the sectors whose distributions change the most over the past decade or so are the Colleges and School Board sectors.

For our audience, we would need to clean up our axis titles to clarify that these proportions are calculated independently. An additional option with your ridgeline plots is the fill variant. Right now we are using a scale_fill_gradient_d() to generate discrete colour fill for each of our density plots. To accomplish a nicer horizontal gradient we will change the layer call to scale_fill_viridis_c() since our x-axis is continuous. Keep in mind that we also cannot set the alpha transparency on our density plots when filling with a gradient. We also have to set our aesthetics fill to stat(x) to accomplish this feat as well.

# 4.3.0 Generate our ridgeline plot with a horizontal gradient

sunshineFinal.df %>%
  # Filter out the "Ministry" datapoints
  filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
         year %in% c(2015, 2023)) %>% 
         # sector %in% c("Colleges", "Universities", "Judiciary", "Hospitals And Boards Of Public Health")) %>% 

  # 1. Data
  ggplot() +
      # 2. Aesthetics
      aes(x = salary, y = sector, fill = stat(x)) +
  
      theme_bw() + # make the theme more plain
      theme(text = element_text(size = 20)) + # set text size
      theme(panel.grid.major.y = element_line(color="grey60")) + # darken our major y grid
      theme(legend.position = "none") + # drop the legend since it's redundant
    
      guides(fill = guide_legend(title="Public sector")) +
      xlab("Salary") + # Set the x-axis label
      ylab("Density") + # Set the y-axis label
      ggtitle("Sector pay distribution 2015 vs. 2023") +
    
      # 3. Scaling
      ### 4.3.0 Use the Magma option 
      ...(name="Salary", option = "C") + 
      xlim(50000, 400000) +

      # 4. Geoms
      ### 4.3.0 Switch our geom to the gradient version
      ... + # Plot the density estimates
  
      # 6. Facets
      # Add a simple facet wrap, with each sector group existing in its own row
      facet_grid(~year,
                 scales="free_x")
## Error in ...(name = "Salary", option = "C"): could not find function "..."

By altering how we fill our density estimates, we emphasize the x-axis range. Cooler colours for lower salaries and “hotter” colours for upper salary ranges. What this really highlights at a quick glance is the Judiciary sector. The salaries remain bimodal but they have clearly shifted over the decade by ~$50K. This is happening for both groups in the distribution but we only see a slightly smaller shift in our other sectors.

Of course there are other ways that you could compare these two sets of distributions as density plots. How would you approach overlapping salary distributions for different years atop each other?


5.0.0 Categorical distribution plots

Did you know the boxplot is nearly 50 years old! First invented in the 1970s by our favourite statistician, John Wilder Tukey, we’ll dig into how and when to use this iconic plot. While we’re here we’ll also take a look at other categorical distribution plots. While our KDE and ridgeline plots provide quite a bit of detail, they can also be a little more limited in their space efficiency. The following categorical distribution plots will perhaps provide some more information efficiency.

5.1.0 Summarize population distributions with geom_boxplot()

As you can see from the previous section, we comfortably fit a quite few distributions on a ridgeline plot. From the looks of it, school boards had one of the tightest distributions while the Judiciary sector had a strange bimodal split amongst its individual salaries.

Can we visualize the data in a more summarized form? Let’s explore the box plot.

The dissection of a boxplot’s components shows us how it summarizes data distribution.

Also known as the box and whisker plot, this visualization conveys the distribution of samples within a group or population and is built upon 5 principal values:

  • “Box”

    • median: dark line across centre of box

    • lower quartile: lower-bound of box (aka lower hinge)

    • upper quartile: upper-bound of box (aka upper hinge)

  • “Whiskers”

    • lower extreme: length of lower whisker
    • upper extreme: length of upper whisker

Together, the lower and upper quartiles produce the interquartile range (IQR). The general implementation of boxplots classify any observations 1.5 IQR above the upper quartile or below the lower quartile as outliers of the distribution. The characteristics of outliers can be set as parameters within the geom_boxplot() layer. Parameters include outlier.shape, outlier.size, and outlier.colour.

Unlike a histogram, the minimum number of values to generate a boxplot is 5. While you could generate a boxplot on fewer numbers, you might not have actual whiskers! This is definitely a great alternative when sample sizes are between 5-30 for each population.

Historically this was a simple way to visualize summary statistics of populations while being easy to produce by hand. Of course, with the age of computing, the production of kernel density estimates have allowed for more diverse visualizations. This plot, however, remains a popular format and thus is more readily understood by general audiences.

Each compact box can take up the same space as a barplot column but it gives much more information about the population. Let’s look at a single aspect - salary in the year of 2023 across our usual sectors.

# 4.2.0 Generate our ridgeline plot to compress our KDEs

sunshineFinal.df %>%
  # Filter out the "Ministry" datapoints
  filter(str_detect(string = sector, pattern = "Ministry:", negate = TRUE),
         year %in% c(2023)) %>% 

  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x=sector, y = salary) +

    theme(text = element_text(size = 20)) + # set text size
    theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
  
    # 3. Scaling
    ylim(0, 1000000) +

    # 4. Geoms
    ### 5.1.0 Add the boxplot geom
    ... 
## Error in eval(expr, envir, enclos): '...' used in an incorrect context

5.2.0 Upgrade your boxplot with some confidence intervals and values

From the looks of it, we can confirm what we saw before in our density plot - that most of the sectors have a pretty tight distribution with longer tails trailing to the higher salaries. Unsurprisingly, Judiciary again shows a different pattern with a much longer boxplot and no outliers reported. We’ll see why in a few sections.

A boxplot can also be updated with a few extra items to help us visualize the data:

  • add data points with geom_jitter() and remove outliers from the boxplot to avoid double-plotting points.

  • mark a ~95% confidence interval within the shape of each box plot with the notch parameter.

  • you can set a variable width for each boxplot that is based on the number of samples used to generate the plot. The wider the box, the more samples were used to generate it.

One thing you have to consider, however, is that when you have too many data points, plotting them can be very dense. Too many points and you can’t really discern any additional information outside of the original boxplot!

For our next example, we’ll reel in the number of observations by focusing on salaries in the University sector way back in 1996. We’ll plot salaries by employer in this case and see how the different institutions fare against each other ~30 years ago.

# 5.2.0 Generate a boxplot of university salaries from 1996

sunshineFinal.df %>%
  # Filter for University data from 1996
  filter(sector == "Universities", year == 1996) %>% 
  
  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x=employer, y = salary) +
    
    # Set some configurations for the plot
    theme(text = element_text(size = 20)) + # set text size
    theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
    theme(legend.position = "none") + # drop the legend since it's redundant
  
    # 3. Scaling
    ylim(0, 300000) +

    # 4. Geoms
    geom_boxplot(outlier.shape=...,  ### 5.2.0 Remove outliers 
                 notch = ...) +    ### 5.2.0 Add a confidence interval notch
  
    ### 5.2.0 Add datapoints
    ...(aes(colour = ...), alpha = 0.5)
## Error in eval(expr, envir, enclos): '...' used in an incorrect context

5.3.0 geom_beeswarm takes plotting your points up to the next level

As you can see from above the geom_jitter() layer does add points to our boxplot by plotting the points such that they avoid overlapping as much as possible. Points are restricted to the width of the boxplot although this can also be adjusted to some degree with the right parameters. geom_jitter() is native to the ggplot2 package with some parameters that allow for a more “random” distribution of your data points within a provided area.

The goal of the ggbeeswarm package is to generate points that will not overlap but they can also be used to simultaneously simulate the kernel density of your data. There are two geoms supplied that work with the ggplot2 package to accomplish this. The first is geom_beeswarm(), which has a number of parameters that can be used to set their aes() mappings but also how the points are laid out:

  • priority: determines the method used to perform the point layout.

    • options include: ascending, descending, density, random, and none.
  • groupOnX: if TRUE then jitter is added to the x axis (default behaviour) otherwise jitter along the y axis.

  • dodge.width: the amount by which different aesthetics groups (must be a factor) will be dodged.

  • show.legend: determines of the this layer should be included in the legends.

    • options include: NA (yes if aesthetics are mapped), FALSE (never include), and TRUE (always include)
  • cex: an optional parameter that can be used to help set the spacing between values.

You may also notice a number of warnings for our plot that our notches went outside the hinges. This means our notches and confidence intervals extend beyond one of the upper or lower quartiles. This problem is likely due to a wide distribution with only a few data points and can result in a strange looking box. Let’s turn it back off for now.

# 5.3.0 Generate a boxplot of university salaries from 1996 with geom_beeswarm

sunshineFinal.df %>%
  # Filter for University data from 1996
  filter(sector == "Universities", year == 1996) %>% 
  
  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x=employer, y = salary) +
    
    # Set some configurations for the plot
    theme(text = element_text(size = 20)) + # set text size
    theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
    theme(legend.position = "none") + # drop the legend since it's redundant
  
    # 3. Scaling
    ylim(0, 300000) +

    # 4. Geoms
    geom_boxplot(outlier.shape=NA,  # Remove outliers 
                 notch = FALSE) +   ### 5.3.0 Remove the confidence interval notch
  
    ### 5.3.0 Add datapoints as a beeswarm
    ...
## Error in eval(expr, envir, enclos): '...' used in an incorrect context

5.3.1 geom_quasirandom() adds KDE structure to your plotting

As we can see from above, the geom_beeswarm() layer adds a little more structure to the data in a somewhat rigid way. Any datapoints that are near each other are deliberately spaced out to almost represent the distribution of your data. Of course, you may run into some issues as your number of datapoints increases or as your data range increases. We see this problem very clearly in our previous plot.

To remedy this, we can balance the visualization a bit with the geom_quasirandom() function. geom_quasirandom() works similarly to the geom_beeswarm() function with emphasis on an additional method of how the points are plotted:

  • method: determine the method for distributing the points. Options include:
    • quasirandom, pseudorandom: generates a KDE before plotting points in a violin shape
    • smiley, frowney: generates a KDE, then points are binned with maximum bin values plotted on the outside or inside respectively
    • tukey: plotted more as dotstrips in a box-plot style.
  • varwidth: if TRUE, vary the width of each group based on the number of samples in the group
# 5.3.1 Generate a boxplot of university salaries from 1996 with geom_quasirandom

sunshineFinal.df %>%
  # Filter for University data from 1996
  filter(sector == "Universities", year == 1996) %>% 
  
  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x=employer, y = salary) +
    
    # Set some configurations for the plot
    theme(text = element_text(size = 20)) + # set text size
    theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
    theme(legend.position = "none") + # drop the legend since it's redundant
  
    # 3. Scaling
    ylim(0, 300000) +

    # 4. Geoms
    geom_boxplot(outlier.shape=NA,  # Remove outliers 
                 notch = FALSE) +   # Remove the confidence interval notch
  
    ### 5.3.1 Add datapoints as a quasirandom beeswarm
    ...
## Error in eval(expr, envir, enclos): '...' used in an incorrect context

Now our data points take a more nuanced approach with a uniform width that shapes the data as a distribution. We’ll see this with even more emphasis in a few sections.

When given the need to show data distribution, try the quasirandom plotting of points over a simple beeswarm.


5.4.0 Add a third dimension to your boxplot with nested boxplots

Is there more to the data we’ve visualized? For our dataset, a logical question to ask next would be how do these salaries compare across different years. We’ve done the same previously with our KDE plots so let’s try it with boxplots.

The kind of boxplot we will create are known as nested boxplots. Technically speaking, up to this point, our multi-boxplot visualizations have been “grouped” boxplots because we are exploring a single variable (salary), across multiple groups within another variable (University employers). Now we will further nest within the employer variable to look at a secondary variable (year) to break each employer into subgroups. We could do this with 2 or more years but we’ll use just 2 for now.

5.4.1 Use n_distinct() to find the number of unique values in a group

When we compare 1996 to 2023 data, we need to keep in mind that a few things can happen. Universities can come and go or get amalgamated with other universities. In some cases, names can change too! To avoid these issues, we’ll do a simple series of filtering to look at Universities that are “unchanged” in name from 1996 to 2023.

First, we’ll filter our data to capture 1996 and 2023 data from Universities. We’ll group_by() and summarise() using a new function n_distinct() to count the number of years in the group. It will be either 1 or 2 and we only want to keep Universities with both 1996 and 2023 data. We’ll wrap it in a conditional to create a logical variable called keep in our summary. From this, we can create our curated list of Universities.

# 5.4.1 Make a curated list of Universities from both 1996 and 2023

universityList <-
  sunshineFinal.df %>% 
  filter(sector == "Universities", 
         year %in% c(1996, 2023)) %>% 
  # Group by employer (ie University)
  group_by(employer) %>% 
  # Are there observations for both years?
  summarise(keep = ...) %>% 
  # Keep only the observations with 2 years
  filter(keep == TRUE) %>% 
  # Grab the list of employers
  pull(employer)
## Error in filter(., keep == TRUE): '...' used in an incorrect context
# Show the list
universityList
## Error in eval(expr, envir, enclos): object 'universityList' not found

So it looks like we’ve narrowed our list of University employers to a set of twelve.

5.4.2 Grouping boxplot on the fill parameter

From there we’ll use the fill parameter to generate different subgroups in our boxplot to make a grouped boxplot. In doing so, we’ll also have to add the use of the geom_quasirandom() parameters:

  • dodge.width: separate the data points by any aesthetic groups that have been assigned.

  • width: set the maximum spread of your data points within each grouping

  • alpha: set the opacity of your data points so you can see more of the overlapping data.

Remember how we said to avoid plotting all points when you have too many? In this case, we can plot all of our points in a cleaner way than geom_jitter() by using a geom_quasirandom() to confine all our datapoints to a smaller area.

# 5.4.2 Generate a grouped and faceted boxplot

sunshineFinal.df %>%
  # Filter for University data from 1996
  filter(sector %in% c("Universities"), 
         year %in% c(1996, 2023),
         employer %in% ...) %>%        ### 5.4.2 filter on our list of employers
  
  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x=employer, 
        y = salary, 
        fill = ...) +                    ### 5.4.2 Set the fill colour based on year!
    
    # Set some configurations for the plot
    theme(text = element_text(size = 20)) + # set text size
    theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
  
    # 3. Scaling
    ylim(0, 500000) +

    # 4. Geoms
    geom_boxplot(outlier.shape=NA,  
                 notch = FALSE) +   
  
    ### 5.4.2 Add datapoints as a quasirandom beeswarm
    geom_quasirandom(dodge.width = 0.7,     # How far apart to set our groups
                     width = 0.15,          # How wide to make each beeswarm
                     alpha = 0.05)          # Set the opacity very low since we expect a lot of points!
## Error in `filter()`:
## i In argument: `employer %in% ...`.
## Caused by error:
## ! '...' used in an incorrect context

In our boxplots, we are plotting the data in both a boxplot and dotplot format. The shape of the dots helped to give a better sense of salary distribution within each subgroup. You can see that the data points overlay on the box but also fall outside. Can we get the compact nature of the boxplot while still getting the visual appeal generated by a density plot?


5.5.0 Violin plot - the lovechild of density and boxplots

As the title says, the violin plot is a mixture of both the boxplot and kernel density plot. It’s a miniaturized KDE that is mirrored across its axis. It encompasses the summary information of the boxplot but in a pear or violin-shaped distribution. To generate a geom_violin() in ggplot, a minimum of three values are required. To justify using a violin, I would again suggest sticking to a similar rule of thumb of a minimum ~30 samples/observations to ensure an accurate representation of the distribution.

The nuanced visualization of a violin plot gives much more information than the boxplot itself and most boxplots can be replaced with a violin plot. Despite the gateway to more detailed distribution information, this format remains less popular/familiar to scientists. Therefore its immediate accessibility to your audience can be limited.

An important parameter of this geom is scale: this sets how big each violin is in comparison to each other. It accepts the following values:

  • area: default value so all violins will have the same area.

  • count: scale areas proportionately to observations.

  • width: all violins have the same maximum width. Total area is not the same for each.

Outliers and violins: Much like a KDE, the theoretical distribution of a violin plot can generate some impossible values - especially at the tails. Remember that this is a theoretical distribution based on the data supplied. Depending on how much variation is in your data, and how many outliers it has, it can really affect the shape of your violin.

# 5.5.0 Generate a violin plot

sunshineFinal.df %>%
  # Filter for University data from 1996
  filter(sector %in% c("Universities"), 
         year %in% c(1996, 2023),
         employer %in% universityList) %>%        # filter on our list of employers
  
  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x=employer, y = salary, fill = factor(year)) +
    
    # Set some configurations for the plot
    theme(text = element_text(size = 20)) + # set text size
    theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
  
    # 3. Scaling
    ylim(0, 500000) +

    # 4. Geoms
    ### 5.5.0 Add the violin geom
    ... + 
  
    # Add datapoints as a quasirandom beeswarm
    geom_quasirandom(dodge.width = 0.85,     # How far apart to set our groups
                     width = 0.15,          # How wide to make each beeswarm
                     alpha = 0.05)          # Set the opacity very low since we expect a lot of points!
## Error in `filter()`:
## i In argument: `employer %in% universityList`.
## Caused by error in `employer %in% universityList`:
## ! object 'universityList' not found

Looking at our violin plot we can see just how closely the geom_quasirandom() points match our violin shape. Depending on the density of your observations, you could easily drop the data points to make things a little cleaner.


5.6.0 Violin plots represent distributions and boxplots summarize them

The major advantage to the violin plot is that, by it’s nature, it is very sensitive to the distribution that produces the density estimate. The boxplot represents the summary information of a distribution but is always a visual representation of a normal distribution. There are not enough parameters supplied to represent anything more complex!

The violin plot is not limited in that respect. Despite some of it’s visual caveats, it can certainly detect multi-modal data. Let’s make a toy example to illustrate using that Judiciary data we saw back in section 4.3.0.

# 5.6.0 Generate a combined violin and boxplot of bimodal data

sunshineFinal.df %>%
  # Filter for Judiciary data
  filter(sector %in% c("Judiciary"), 
         year %in% c(2008:2010, 2020:2023)) %>%
  mutate(year = factor(year)) %>% 
  
  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x=salary, y = year) +
    
    # Set some configurations for the plot
    theme_bw() +
    theme(text = element_text(size = 20)) + # set text size
    theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
  
    # 3. Scaling
    xlim(0, 500000) +

    # 4. Geoms
    geom_violin(scale = "width", colour="darkviolet") + 
    geom_boxplot(alpha = 0.6, colour = "blue") + 
    geom_quasirandom(alpha = 0.6, colour = "darkgreen")
## Orientation inferred to be along y-axis; override with
## `position_quasirandom(orientation = 'x')`

In our toy model with the Judiciary data, we can see that there really are 2 distinct populations. The datapoints and violins both mirror this. However, looking closely at the boxplot, we only see a long distribution of the data. Without plotting the datapoints, you would not know if the data is spread evenly over the boxplot’s area (remember it contains the middle 50% of the data) or if it is hiding a bimodal distribution.

Take-home message: while boxplots are great and easy to interpret on the surface, always ask to see the datapoints or a violin instead!


5.7.0 Combine violin and boxplots into the ultimate plot

From our above example, you can see that we blended a number of geoms together. With a little working around, we can also plot both violins and boxplots together in a multivariate setting! This gives us the familiarity of the boxplot but also clearly displays the theoretical distribution. Some steps to accomplish this:

  1. To put emphasis on the violin plots we set the scale parameter to “width”.
  2. We need to adjust some of the boxplot parameters to fit them within the violin plots
  3. Some adjustments to the aes() parameters for our geoms directly to ensure our points are plotted correctly.
  4. Drop the geom_quasirandom() layer since our violins mirror the same shape so well.
# 5.7.0 Generate a combination violin plot with boxplot

sunshineFinal.df %>%
  # Filter for University data from 1996
  filter(sector %in% c("Universities"), 
         year %in% c(1996, 2023),
         employer %in% universityList) %>%        # filter on our list of employers
  mutate(year = factor(year)) %>% 
  
  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x=employer, y = salary) +
    
    # Set some configurations for the plot
    theme(text = element_text(size = 20)) + # set text size
    theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
  
    # 3. Scaling
    ylim(0, 500000) +
    scale_colour_manual(values=c("black", "black")) + # we'll need this to fix our boxplots

    # 4. Geoms
    ### 5.7.0 Add the violin geom
    geom_violin(scale = "width", aes(...)) + 
  
    ### 5.7.0 Boxplot but smaller width so they reside "within" the violin plot. 
    # Do you think the order matters?
    geom_boxplot(aes(...), width=0.2, 
                 position = position_dodge(width=0.9), 
                 outlier.shape=NA) # Remove the outliers
## Error in `filter()`:
## i In argument: `employer %in% universityList`.
## Caused by error in `employer %in% universityList`:
## ! object 'universityList' not found

Section 5.7.0 Comprehension Question: Looking at the above code for our graph, why do you think we set the aes(fill/group = year) aesthetic individually for some layers rather than directly in the top aes() layer?


5.8.0 Parallel coordinate plots can help visualize multivariate date

While the above visualization shows with some clarity the total distribution of salary across some of our universities, the messiness of outliers has invaded into some of the violin plots themselves. An experienced eye, however can still see that the bulk of the population centres around our internal boxplots offset by the stretched out look of our violins. We could, of course clean it up by removing some of the outliers ahead of time but I generally keep outliers as a visual reminder that experiments aren’t perfect!

Suppose, however, we wanted to add more sectors to our data like “Colleges” across a larger timespan like 15 years? With so many potential employers in each sector, things would start to get very crowded. To accommodate all of that data, you could facet it into multiple groups based on the sectors, for instance, but this would separate the datasets from each other. Sometimes things look sharper when you can see it all on a single plot.

Organizing multiple groups, across multiple categories is the domain of the parallel coordinate plot. To accomplish this feat we’ll need to do a couple of key things:

  1. We’ll need to group and summarize our data after filtering. We’ll gather the mean salary for each employer (within a sector) at each year.
  2. Set a group parameter made from multiple variables. Normally, the group parameter can only be set to a single variable (column) in your data. We can, however, make a new variable “on the fly” using the interaction() function. We can supply multiple variables from our dataset to help create this new interaction group. This is an extremely helpful way to subgroup your data much like using a group_by() call within the gpplot() function.
# 5.8.0 Generate a parallel coordinate plot

sunshineFinal.df %>%
  # Filter for University data from 1996
  filter(sector %in% c("Universities", "Colleges"), 
         year %in% c(2010:2023),
         ) %>%        # filter on our list of employers
  mutate(year = factor(year)) %>% 
  # Group the data
  group_by(sector, employer, year) %>% 
  # Summarise to get mean salary for each 
  summarise(meanSalary = mean(salary)) %>%
  # Revert back to a regular data frame
  ungroup() %>%
  
  # 1. Data
  ggplot(.) +
    # 2. Aesthetics
    aes(x=year, y = meanSalary, colour = sector) +
    
    # Set some configurations for the plot
    theme(text = element_text(size = 20)) + # set text size
    theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5)) + # Adjust our x-axis text
  
    # 3. Scaling
    ylim(100000, 200000) +

    # 4. Geoms
    ### 5.8.0 A line plot to join the points
    geom_line(aes(group = ...)) +    # Set the group aesthetic by a dual-variable set
                  
    ### 5.8.0 Add the points in also to help follow the connections
    geom_point()
## `summarise()` has grouped output by 'sector', 'employer'. You can
## override using the `.groups` argument.
## Error in layer(data = data, mapping = mapping, stat = stat, geom = GeomLine, : '...' used in an incorrect context

What our plot does here is connect each employer across time, showing the mean salary for those on the sunshine list which can give us a sense of trend for a single employer, but lines are coloured based on sector so that we can also see trends between “Colleges” vs “Universities”. For instance, we can see something strange in our data from 2017 where across ALL colleges, there appeared to be a large shift in salaries before returning to the regular trajectory.

Similarly, we see the various Universities in our data all follow a similar upward trend in their mean salary growth over time. Pretty cool right?


6.0.0 Class summary

We’ve covered a number of key plots today, including when and how to use them. Next week we’ll revisit some of these plots and spruce them up with extra touches that will take them that extra distance. Below you’ll find a summary of what we’ve discussed today.

6.1.0 Summary of plots

Plot Key Features Notes
Scatterplot Good for exploring relationships between variables Bubbleplots add an extra dimension to your data
Barplot Present values across groups. Presenting proportions, small sample sizes
Stack categories for extra dimension Does not dissect individual distributions
Nightingale plot Circular-wedge barplot, same properties as barplot Presenting data over unordered groups
Visual emphasis on outer area size may mislead reader
Racetrack plot Circular-ringed barplot, same properties as barplot Looking for a more compact way to show barplots
Calculate length by radians as outer rings are “stretched”
Density plot Theoretical distribution of your sample data Minimum sample size 4 but 30 is more reliable
Can plot up to 5 distributions on same axis
Tails can produce “ghost” data
Ridgeplots Allows tighter visualizations of multiple densities Good way to pack more KDEs into a smaller area
Same properties as KDEs No real control of outliers
Similar “ghost” data issues
Box and whisker plot Summarize distributions with 5 parameters Popular and compact presentation of simple populations
Minimum sample size = 5
Does not properly visualize multi-modal data
Violin plots Boxplot format with KDE violin shape Compact representation of distribution shape
Less popular with nuanced interpretation
Inherits “ghost” data and other properties of KDE
DOES interpret multi-modal populations
Parallel coordinate plot Visual representation of multivariate data Connects trends across groups
Related data can be connected linearly Not limited by number of samples
Can help identify trends within multicategorical data

6.2.0 Weekly assignment

This week’s assignment will be found under the current lecture folder under the “assignment” subfolder. It will include an R markdown notebook that you will use to produce the code and answers for this week’s assignment. Please provide answers in markdown or code cells that immediately follow each question section.

Assignment breakdown
Code 50% - Does it follow best practices?
- Does it make good use of available packages?
- Was data prepared properly
Answers and Output 50% - Is output based on the correct dataset?
- Are groupings appropriate
- Are correct titles/axes/legends correct?
- Is interpretation of the graphs correct?

Since coding styles and solutions can differ, students are encouraged to use best practices. Assignments may be rewarded for well-coded or elegant solutions.

You can save and download the markdown notebook in its native format. Submit this file to the the appropriate assignment section by 12:59 pm on the date of our next class: March 27th, 2025.


6.3.0 Acknowledgements

Revision 1.0.0: created and prepared for CSB1021H S LEC0141, 03-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.0.1: edited and prepared for CSB1020H S LEC0141, 03-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.0.2: edited and prepared for CSB1020H S LEC0141, 03-2023 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 2.0.0: Revised and prepared for CSB1020H S LEC0141, 03-2024 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 3.0.0: Revised and prepared for CSB1020H S LEC0141, 03-2025 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.


The Center for the Analysis of Genome Evolution and Function (CAGEF)

The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.

From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.

For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.